Feature Manipulation in Pandas

Here let's look at a different dataset that will allow us to really dive into some meaningful visualizations. This data set is publically available, but it is also part of a Kaggle competition.

You can get the data from here: https://www.kaggle.com/c/titanic-gettingStarted or you can use the code below to load the data from GitHub.

There are lots of iPython notebooks for looking at the Titanic data. Check them out and see if you like any better than this one!

When going through visualization options, I recommend the following steps:

  • Would you like the visual to be interactive?
    • Yes, Does it have a lot of data?
      • No, Use plotly or bokeh
      • Yes, sub-sample and then use plotly/bokeh
    • No, Does seaborn have a built-in function for plotting?
      • Yes, use seaborn
      • No, Does Pandas support the visual?
        • Yes, use pandas
        • No, use low level matplotlib

Look at various high level plotting libraries like:

Adding Dependencies (for Jupyter Lab)

  • conda install -c conda-forge missingno
  • conda install nodejs
  • jupyter labextension install @jupyterlab/plotly-extension

Loading the Titanic Data for Example Visualizations

In [1]:
# load the Titanic dataset
import pandas as pd
import numpy as np

print('Pandas:', pd.__version__)
print('Numpy:',np.__version__)

df = pd.read_csv('https://raw.githubusercontent.com/eclarson/DataMiningNotebooks/master/data/titanic.csv') # read in the csv file

df.head()
Pandas: 0.25.3
Numpy: 1.18.1
Out[1]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [2]:
# note that the describe function defaults to using only some variables
df.describe()
Out[2]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [3]:
print(df.dtypes)
print('===========')
print(df.info())
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
===========
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

Questions we might want to ask:

  • What percentage of passengers survived the Titanic disaster?
  • What percentage survived in each class (first, coach, etc.)?
  • How many people traveled in each class? How many classes are there?
In [4]:
# the percentage of individuals that survived on the titanic
sum(df.Survived==1)/len(df)*100.0
Out[4]:
38.38383838383838

Grouping the Data

In [5]:
# Lets aggregate by class and count survival rates
df_grouped = df.groupby(by='Pclass')
for val,grp in df_grouped:
    print('There were',len(grp),'people traveling in',val,'class.')
There were 216 people traveling in 1 class.
There were 184 people traveling in 2 class.
There were 491 people traveling in 3 class.
In [6]:
# an example of using the groupby function with a data column
print(df_grouped['Survived'].sum())
print('---------------------------------------')
print(df_grouped.Survived.count())
print('---------------------------------------')
print(df_grouped.Survived.sum() / df_grouped.Survived.count())

# might there be a better way of displaying this data?
Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64
---------------------------------------
Pclass
1    216
2    184
3    491
Name: Survived, dtype: int64
---------------------------------------
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64
In [7]:
# Class Exercise: Create code for calculating the std error
# std / sqrt(N) 

Cleaning the Dataset

Let's start by visualizing some of the missing data in this dataset. We will use the missingno package to help visualize where the data contains NaNs. This is a great tool for looking at nan values and how we might go about filling in the values.

For this visualization, we can use a visualization library called missingno that hs many types of visuals for looking at missing data in a dataframe. I particularly like the matrix visualization, but there are many more to explore:

Plot Type One: Filter Bar

In [8]:
# this python magics will allow plot to be embedded into the notebook
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline 

import missingno as mn

mn.matrix(df.sort_values(by=["Cabin","Embarked","Age",]))
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a20cf2898>
In [9]:
# let's clean the dataset a little before moving on

if False: # skip this for now!, just archive it 

    # 1. Remove attributes that just arent useful for us
    for col in ['PassengerId','Name','Cabin','Ticket']:
        if col in df:
            del df[col]

    # 2. Impute some missing values, grouped by their Pclass and SibSp numbers
    df_grouped = df.groupby(by=['Pclass','SibSp', 'Sex'])

    # now use this grouping to fill the data set in each group, then transform back

    # 3. create new dataframe that fills groups with the median of that group
    func = lambda grp: grp.fillna(grp.median())
    df_imputed = df_grouped.transform(func)

    # 4. fill any deleted columns
    col_deleted = list( set(df.columns) - set(df_imputed.columns)) # in case the median operation deleted columns
    df_imputed[col_deleted] = df[col_deleted]

    # 4. drop rows that still had missing values after grouped imputation
    df_imputed.dropna(inplace=True)

    # 5. Rearrange the columns
    df_imputed = df_imputed[['Survived','Age','Sex','Parch','SibSp','Pclass','Fare','Embarked']]
In [10]:
# let's clean the dataset a little before moving on

# 1. Remove attributes that just arent useful for us
for col in ['PassengerId','Name','Cabin','Ticket']:
    if col in df:
        del df[col]
        

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB
In [11]:
# impute based upon the K closest samples (rows)
from sklearn.impute import KNNImputer
import copy

# get object for imputation
knn_obj = KNNImputer(n_neighbors=5)

# create a numpy matrix from pandas numeric values to impute
temp = df[['Pclass','Age','SibSp','Parch','Fare']].to_numpy()

# use sklearn imputation object
knn_obj.fit(temp)
temp_imputed = knn_obj.transform(temp)
#    could have also done:
# temp_imputed = knn_obj.fit_transform(temp)

# this is VERY IMPORTANT, make a deep copy, not just a reference to the object
df_imputed = copy.deepcopy(df) # not just an alias
df_imputed[['Pclass','Age','SibSp','Parch','Fare']] = temp_imputed
df_imputed.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null float64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null float64
Parch       891 non-null float64
Fare        891 non-null float64
Embarked    889 non-null object
dtypes: float64(5), int64(1), object(2)
memory usage: 55.8+ KB
In [12]:
# let's show some very basic plotting to be sure the data looks about the same

df_imputed.Age.plot(kind='hist',alpha=0.5)
df.Age.plot(kind='hist', alpha=0.5)

plt.show()

[back to slides]

Feature Discretization

This is an example of how to make a continuous feature and ordinal feature. Let's try to give some human intuition to a variable by grouping the data by age.

Question: Does age range influence survival rates?

In [13]:
# let's break up the age variable
df_imputed['age_range'] = pd.cut(df_imputed['Age'],[0,15,25,65,1e6],
                                 labels=['child','young adult','adult','senior']) # this creates a new variable
df_imputed.age_range.describe()
Out[13]:
count       891
unique        4
top       adult
freq        556
Name: age_range, dtype: object
In [14]:
# now lets group with the new variable
df_grouped = df_imputed.groupby(by=['Pclass','age_range'])
print ("Percentage of survivors in each group:")
print (df_grouped.Survived.sum() / df_grouped.Survived.count() *100)
Percentage of survivors in each group:
Pclass  age_range  
1.0     child           83.333333
        young adult     78.378378
        adult           59.763314
        senior          25.000000
2.0     child          100.000000
        young adult     41.304348
        adult           41.880342
        senior           0.000000
3.0     child           44.067797
        young adult     21.250000
        adult           21.851852
        senior           0.000000
Name: Survived, dtype: float64


Visualization in Python with Pandas, Matplotlib, and Others

In [15]:
# this python magics will allow plot to be embedded into the notebook
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline 

print('Matplotlib:', matplotlib. __version__)
# could also say "%matplotlib notebook" here to make things interactive
Matplotlib: 3.1.1

Visualizing the dataset

Pandas has plenty of plotting abilities built in. Let's take a look at a few of the different graphing capabilities of Pandas with only matplotlib. Afterward, we can make the visualizations more beautiful.

Visualization Techniques: Distributions

  • Histogram
    • Usually shows the distribution of values of a single variable
    • Divide the values into bins and show a bar plot of the number of objects in each bin.
  • Kernel Density Estimation

    • Add up Gaussian underneath each point value
    • STD of gaussian is equivalent to number of bins in histogram

    KDE Example: TukeyBoxplot

    Question: What were the ages of people on the Titanic?

    #### Plot Type Two: Histogram and Kernel Density

In [16]:
# Start by just plotting what we previously grouped!
plt.style.use('ggplot')

fig = plt.figure(figsize=(15,5))

plt.subplot(1,3,1)
df_imputed.Age.plot.hist(bins=20)

plt.subplot(1,3,2)
df_imputed.Age.plot.kde(bw_method=0.2)

plt.subplot(1,3,3)
df_imputed.Age.plot.hist(bins=20)
df_imputed.Age.plot.kde(bw_method=0.1, secondary_y=True)
plt.ylim([0, 0.06])

plt.show()

Two-Dimensional Distributions

  • Estimate the joint distribution of the values of two attributes

    • Example: petal width and petal length
    • What does this tell us?

    Question: How does age relate to the fare that was paid?

In [17]:
plt.hist2d(x=df_imputed.Age, y=df_imputed.Fare, bins=30)
plt.colorbar()
plt.xlabel("Age")
plt.ylabel("Fare")
plt.show()

The above plot is not all that meaningful. We can probably do better than visualizing the joint distribution using 2D histograms. Let's face it: 2D histrogram are bound to be sparse and not very descriptive. Instead, let's do something smarter.

Feature Correlation Plot

  • First lets visualize the correlation between the different features.

    #### Plot Type Three: Heatmap (of correlation)

In [18]:
# plot the correlation matrix 
vars_to_use = ['Survived', 'Age', 'Parch', 'SibSp', 'Pclass', 'Fare'] # pick vars
plt.pcolor(df_imputed[vars_to_use].corr()) # do the feature correlation plot

# fill in the indices
plt.yticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.xticks(np.arange(0.5, len(vars_to_use), 1), vars_to_use)
plt.colorbar()
plt.show()

Grouped Count Plots

Used when you have multiple categorical or nominal variables that you want to show together in sub-groups. Grouping mean to display the counts of different subgroups on the dataset. For the titanic data, this can be quite telling of the dataset.

Question: Does age, gender, or class have an effect on survival?

Plot Type Four: Grouped Bar Chart

In [19]:
# first group the data
df_grouped = df_imputed.groupby(by=['Pclass','age_range'])

# tabulate survival rates of each group
survival_rate = df_grouped.Survived.sum() / df_grouped.Survived.count()

# show in a bar chart using builtin pandas API
ax = survival_rate.plot(kind='barh')
plt.title('Survival Percentages by Class and Age Range')
plt.show()
In [20]:
# the cross tab operator provides an easy way to get these numbers
survival = pd.crosstab([df_imputed['Pclass'],
                        df_imputed['age_range']], # categories to cross tabulate
                       df_imputed.Survived.astype(bool)) # how to group
print(survival)

survival.plot(kind='bar', stacked=True)
plt.show()
Survived            False  True 
Pclass age_range                
1.0    child            1      5
       young adult      8     29
       adult           68    101
       senior           3      1
2.0    child            0     19
       young adult     27     19
       adult           68     49
       senior           2      0
3.0    child           33     26
       young adult    126     34
       adult          211     59
       senior           2      0
In [21]:
# plot overall cross tab with both groups
plt.figure(figsize=(15,3))
ax1 = plt.subplot(1,3,1)
ax2 = plt.subplot(1,3,2)
ax3 = plt.subplot(1,3,3)

pd.crosstab([df_imputed['Pclass']], # categories to cross tabulate
            df_imputed.Survived.astype(bool)).plot(kind='bar', stacked=True, ax=ax1) 

pd.crosstab([df_imputed['age_range']], # categories to cross tabulate
            df_imputed.Survived.astype(bool)).plot(kind='bar', stacked=True, ax=ax2) 

pd.crosstab([df_imputed['Sex']], # categories to cross tabulate
            df_imputed.Survived.astype(bool)).plot(kind='bar', stacked=True, ax=ax3) 

plt.show()

Sub-group Distribution Plots

  • Box Plots
    • Invented by J. Tukey
    • Another way of displaying the distribution of data
    • Following figure shows the basic part of a box plot:

TukeyBoxplot

Plot Type Five: Box Plot

In [22]:
ax = df_imputed.boxplot(column='Fare', by = 'Pclass') # group by class
plt.ylabel('Fare')
plt.title('')
ax.set_yscale('log') # so that the boxplots are not squished

The problem with boxplots is that they might hide important aspects of the ditribution. For example, this plot shows data that all have the exact same boxplot.

TukeyBoxplot

Simplifying Plotting with Seaborn

Using pandas and matplotlib is great until you need to redo or make more intricate plots. Let's see about one or two APIs that might simplify our lives. First, let's use Seaborn.

  • import seaborn as sns

In seaborn, we have access to a number of different plotting tools. Let's take a look at:

Plot Type Six:

  • Box Plots
  • Swarm Plots
  • Violin Plots
In [23]:
import seaborn as sns
# cmap = sns.diverging_palette(220, 10, as_cmap=True) # one of the many color mappings

print('Seaborn:', sns. __version__)
# now try plotting some of the previous plots, way more visually appealing!!
Seaborn: 0.9.0
In [24]:
# sns boxplot
plt.subplots(figsize=(20, 5))

plt.subplot(1,3,1)
sns.boxplot(x="Sex", y="Age", hue="Survived", data=df_imputed)
plt.title('Boxplot Example')

plt.subplot(1,3,2)
sns.violinplot(x="Sex", y="Age", hue="Survived", data=df_imputed)
plt.title('Violin Example')

plt.subplot(1,3,3)
sns.swarmplot(x="Sex", y="Age", hue="Survived", data=df_imputed)
plt.title('Swarm Example')

plt.show()
In [25]:
# ASIDE: UGH so much repeated code, can we do "better"?
plt.subplots(figsize=(20, 5))
args = {'x':"Sex", 'y':"Age", 'hue':"Survived", 'data':df_imputed}
for i, plot_func in enumerate([sns.boxplot, sns.violinplot, sns.swarmplot]):
    plt.subplot(1,3,i+1)
    plot_func(**args) # more compact, LESS readable
    
plt.show()
In [26]:
sns.violinplot(x="Sex", y="Age", hue="Survived", data=df_imputed, 
               split=True, inner="quart")

plt.show()

Self Test 2a.2

TukeyBoxplot


Matrix Plots

  • Plot some data from a matrix
  • This can be useful when objects are sorted well
  • Typically, the attributes are normalized to prevent one attribute from dominating the plot
  • Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects
  • Two versions:

    • Feature Based
    • Instance Based

    Question: Which features are most similar to each other?

In [27]:
# the correlation plot is Feature based becasue we get
# a place in the plot for each feature
# in this plot we are asking, what features are most correlated? 
cmap = sns.set(style="darkgrid") # one of the many styles to plot using

f, ax = plt.subplots(figsize=(5, 5))
sns.heatmap(df_imputed.corr(), cmap=cmap, annot=True)

f.tight_layout()

New Question: Which passengers are most similar to one another?

In [28]:
# but we could also be asking, what instances are most similar to each other?

# NOTE: Correlation here is defined as a distance metric by scipy 
# https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.correlation.html 
# it is defined as 1-CC, so '0' means highly correlated

from sklearn.metrics.pairwise import pairwise_distances

vars_to_use = [ 'Age', 'Pclass', 'Fare', 'SibSp','Parch'] # pick vars

xdata = pairwise_distances(df_imputed[vars_to_use].values, # get numpy matrix
                           metric='correlation')
sns.heatmap(xdata, cmap=cmap, annot=False)
print('What is wrong with this plot?')
What is wrong with this plot?
In [29]:
# lets fix a few things
# first, the difference between each instance was large, 
#  impacted by the biggest variable, Fare

from sklearn.preprocessing import StandardScaler

# lets scale the data to be zero mean, unit variance
std = StandardScaler()

xdata = pairwise_distances(std.fit_transform(df_imputed[vars_to_use].values), 
                           metric='correlation')
sns.heatmap(xdata, cmap=cmap, annot=False)
print('Is there still something wrong?')
Is there still something wrong?
In [30]:
f, ax = plt.subplots(figsize=(8, 7))

# lets scale the data to be zero mean, unit variance
std = StandardScaler()
# and lets also sort the data
df_imputed_copy = df_imputed.copy().sort_values(by=['Pclass','Age','Survived'])

xdata = pairwise_distances(std.fit_transform(df_imputed_copy[vars_to_use].values), 
                           metric='correlation')
sns.heatmap(xdata, cmap=cmap, annot=False)
print('Is there anything we can conclude?')
Is there anything we can conclude?

Revisiting other Plots in Seaborn

In [31]:
# can we make a better combined histogram and KDE?
sns.distplot(df_imputed.Age)
plt.show()
In [32]:
# lets make a pretty plot of the scatter matrix
df_imputed_jitter = df_imputed.copy()
df_imputed_jitter[['Parch','SibSp','Pclass']] += np.random.rand(len(df_imputed_jitter),3)/2 
sns.pairplot(df_imputed_jitter, hue="Survived", size=2,
            plot_kws=dict(s=20, alpha=0.15, linewidth=0))
plt.show()
/Users/eclarson/anaconda3/envs/mlenv/lib/python3.6/site-packages/seaborn/axisgrid.py:2065: UserWarning: The `size` parameter has been renamed to `height`; pleaes update your code.
  warnings.warn(msg, UserWarning)
/Users/eclarson/anaconda3/envs/mlenv/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py:487: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
/Users/eclarson/anaconda3/envs/mlenv/lib/python3.6/site-packages/statsmodels/nonparametric/kdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2

A Final Note on Plotting:

The best plots that you can make are probably ones that are completely custom to the task or question you are trying to solve/answer. These plots are also the most difficult to get correct because they take a great deal of iteration, time, and effort to get perfected. They also take some time to explain. There is a delicate balance between creating a new plot that answers exactly what you are asking (in the best way possible) and spending and inordinate amount of time on a new plot (when a standard plot might be a "pretty good" answer)

TukeyBoxplot


Revisiting with Interactive Visuals: Plotly

More updates to come to this section of the notebook. Plotly is a major step in the direction of using JavaScript and python together and I would argue it has a much better implementation than other packages.

In [33]:
# directly from the getting started example...
import plotly
print('Plotly:', plotly. __version__)

plotly.offline.init_notebook_mode() # run at the start of every notebook
plotly.offline.iplot({
    "data": [{
        "x": [1, 2, 3],
        "y": [4, 2, 5]
    }],
    "layout": {
        "title": "hello world"
    }
})
Plotly: 3.1.1
In [34]:
from plotly.graph_objs import Scatter, Layout
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
# let's manipulate the example to serve our purposes

# plotly allows us to create JS graph elements, like a scatter object
plotly.offline.iplot({
    'data':[
        Scatter(x=df_imputed.SibSp.values+np.random.rand(*df_imputed.SibSp.shape)/2,
                y=df_imputed.Age,
      
                text=df_imputed.Survived.values.astype(str),
                marker=Marker(size=df_imputed.Fare, sizemode='area', sizeref=1,),
                mode='markers')
            ],
    'layout': Layout(xaxis=XAxis(title='Sibling and Spouses'), 
                     yaxis=YAxis(title='Age'),
                     title='Age and Family Size (Marker Size==Fare)')
}, show_link=False)

Visualizing more than three attributes requires a good deal of thought. In the following graph, lets use interactivity to help bolster the analysis. We will create a graph with custom text overlays that help refine the passenger we are looking at. We will

  • color code whether they survived
  • Scatter plot Age and Social class
  • Code the number of siblings/spouses traveling with them through the size of the marker
In [35]:
def get_text(df_row):
    return 'Age: %d<br>Class: %d<br>Fare: %.2f<br>SibSpouse: %d<br>ParChildren: %d'%(df_row.Age,df_row.Pclass,df_row.Fare,df_row.SibSp,df_row.Parch)

df_imputed['text'] = df_imputed.apply(get_text,axis=1)
textstring = ['Perished','Survived', ]

plotly.offline.iplot({
    'data': [ # creates a list using a comprehension
        Scatter(x=df_imputed.Pclass[df_imputed.Survived==val].values+np.random.rand(*df_imputed.SibSp[df_imputed.Survived==val].shape)/2,
                y=df_imputed.Age[df_imputed.Survived==val],
                text=df_imputed.text[df_imputed.Survived==val].values.astype(str),
                marker=Marker(size=df_imputed[df_imputed.Survived==val].SibSp, sizemode='area', sizeref=0.01,),
                mode='markers',
                name=textstring[val]) for val in [0,1]
    ],
    'layout': Layout(xaxis=XAxis(title='Social Class'), 
                     yaxis=YAxis(title='Age'),
                     title='Age and Class Scatter Plot, Size = number of siblings and spouses'),
    
}, show_link=False)

Check more about using plotly here:

In this notebook you learned:

  • How to read in from a file using pandas
  • How to manipulate data with basic operations in pandas
  • How to group data in pandas
  • How to use Scikit-learn for imputation
  • Some common visualizations in Pandas, Seaborn, and Plotly

Todo: create and use some Bokeh examples here

In [ ]: